{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<center>\n",
    "<img src=\"https://habrastorage.org/files/fd4/502/43d/fd450243dd604b81b9713213a247aa20.jpg\" />\n",
    "</center> \n",
    "     \n",
    "## <center>  [mlcourse.ai](https://mlcourse.ai) – Open Machine Learning Course \n",
    "\n",
    "Auteur: [Yury Kashnitsky](https://yorko.github.io) (@yorko). Edité et traduit par [Ousmane Cissé](https://github.com/oussou-dev). This material is subject to the terms and conditions of the [Creative Commons CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/) license. Free use is permitted for any non-commercial purpose. \n",
    "\n",
    "# <center>Mission #2. Automne 2019\n",
    "## <center> Partie 2. Gradient boosting"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Dans cette mission, vous êtes invité à faire mieux que la \"baseline\" (référence de base) dans la compétition [\"Flight delays](https://www.kaggle.com/c/flight-delays-fall-2018).**\n",
    "\n",
    "Cette fois-ci, nous avons décidé de partager un scénario de base CatBoost plutôt correct. Vous devrez améliorer la solution fournie.\n",
    "\n",
    "Avant de vous lancer sur la mission, nous vous conseillons de consulter le matériel de cours correspondant:\n",
    " 1. [Classification, arbres-de-décision et k-plus-proches-voisins](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic03_decision_trees_kNN/topic3_decision_trees_kNN.ipynb?flush_cache=true), version interactive sur [Kaggle](https://www.kaggle.com/kashnitsky/topic-3-decision-trees-and-knn)\n",
    " 2. Ensembles:\n",
    "  - [Bagging](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part1_bagging.ipynb?flush_cache=true), version interactive sur [Kaggle](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-1-bagging)\n",
    "  - [Forêt aléatoire](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part2_random_forest.ipynb?flush_cache=true), version intercative sur [Kaggle](https://www.kaggle.com/kashnitsky/topic-5-ensembles-part-2-random-forest)\n",
    "  - [Feature Importance](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic05_ensembles_random_forests/topic5_part3_feature_importance.ipynb?flush_cache=true), version interactive sur [Kaggle](https://www.kaggle.com/kashnitsky/topic-5- ensembles-part-3-feature-importance)\n",
    " 3. [Gradient boosting](https://nbviewer.jupyter.org/github/Yorko/mlcourse_open/blob/master/jupyter_english/topic10_boosting/topic10_gradient_boosting.ipynb?flush_cache=true), version interactive sur [Kaggle](https://www.kaggle.com/kashnitsky/topic-10-gradient-boosting)\n",
    "  - Régression logistique, forêt aléatoire et LightGBM dans le cadre de la compétition \"Kaggle Forest Cover Type Prediction\": [Kernel](https://www.kaggle.com/kashnitsky/topic-10-practice-with-logit-rf-and-lightgbm) \n",
    " 4. Vous pouvez également vous exercer avec des missions de démonstration, plus simples et déjà partagées avec des solutions:\n",
    "  - \"Decision trees with a toy task and the UCI Adult dataset\": [mission](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees) + [solution](https://www.kaggle.com/kashnitsky/a3-demo-decision-trees-solution)\n",
    "  - \"Logistic Regression and Random Forest in the credit scoring problem\": [mission](https://www.kaggle.com/kashnitsky/assignment-5-logit-and-rf-for-credit-scoring) + [solution](https://www.kaggle.com/kashnitsky/a5-demo-logit-and-rf-for-credit-scoring-sol)\n",
    " 5. Il y a également 7 conférences vidéo sur les arbres, les forêts, la stimulation et leurs applications [mlcourse.ai/lectures](https://mlcourse.ai/lectures) \n",
    " 6. Tutoriels mlcourse.ai sur [categorical feature encoding](https://www.kaggle.com/waydeherman/tutorial-categorical-encoding) (par Wayde Herman) et [CatBoost](https://www.kaggle.com/mitribunskiy/tutorial-catboost-overview) (par Mikhail Tribunskiy)\n",
    " 7. Dernier point important : [Kernels publics](https://www.kaggle.com/c/flight-delays-fall-2018/notebooks) liés à cette compétition\n",
    "\n",
    "\n",
    "### Votre mission consiste à:\n",
    " 1. battre**\"A2 de base (10 crédits)\"**sur Public LB (**0.75914 <**> score LB)\n",
    " 2. renommer votre [équipe](https://www.kaggle.com/c/flight-delays-fall-2018/team) en totale conformité avec A1 et le [classement du cours](https://docs.google.com/spreadsheets/d/15e1K0tg5ponA5R6YQkZfihrShTDLAKf5qeKaoVCiuhQ/) (à paraître le 16.09.2019)\n",
    " \n",
    "Cette tâche est censée être relativement facile. Ici, vous n'êtes pas obligé de télécharger votre solution reproductible.\n",
    " \n",
    "### <center> Date limite pour A2: le 6 octobre 2019 à 20h59 (heure de Londres)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import warnings\n",
    "from pathlib import Path\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "from catboost import CatBoostClassifier\n",
    "from sklearn.metrics import roc_auc_score\n",
    "from sklearn.model_selection import train_test_split"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Lire les données**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PATH_TO_DATA = Path(\"../input/flight-delays-fall-2018/\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df = pd.read_csv(PATH_TO_DATA / \"flight_delays_train.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df = pd.read_csv(PATH_TO_DATA / \"flight_delays_test.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "test_df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Créer une seule caractéristique - \"flight\" (que vous devez améliorer - ajouter plus de caractéristiques)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "train_df[\"flight\"] = train_df[\"Origin\"] + \"-->\" + train_df[\"Dest\"]\n",
    "test_df[\"flight\"] = test_df[\"Origin\"] + \"-->\" + test_df[\"Dest\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Se souvenir des index des caractéristiques catégorielles (à passer à CatBoost)**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "categ_feat_idx = np.where(\n",
    "    train_df.drop(\"dep_delayed_15min\", axis=1).dtypes == \"object\"\n",
    ")[0]\n",
    "categ_feat_idx"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Allouer un jeu de données de test (par exemple, un jeu de validation) pour valider le modèle**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train = train_df.drop(\"dep_delayed_15min\", axis=1).values\n",
    "y_train = train_df[\"dep_delayed_15min\"].map({\"Y\": 1, \"N\": 0}).values\n",
    "X_test = test_df.values"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train_part, X_valid, y_train_part, y_valid = train_test_split(\n",
    "    X_train, y_train, test_size=0.3, random_state=17\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Entraîner Catboost avec les arguments par défaut, en ne passant que les index des caractéristiques catégorielles.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ctb = CatBoostClassifier(random_seed=17, silent=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "ctb.fit(X_train_part, y_train_part, cat_features=categ_feat_idx);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ctb_valid_pred = ctb.predict_proba(X_valid)[:, 1]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Nous avons obtenu environ 0.756 ROC AUC sur les données de test (hold-out set).**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "roc_auc_score(y_valid, ctb_valid_pred)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "**Entraîner l'ensemble des données d'entraînement, faites des prévisions sur l'ensemble des données de test. Nous obtenons ~ 0,734 lors de la compétition - Niveau de (baseline) référence \"Catboost starter\"**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%time\n",
    "ctb.fit(X_train, y_train, cat_features=categ_feat_idx);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ctb_test_pred = ctb.predict_proba(X_test)[:, 1]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "with warnings.catch_warnings():\n",
    "    warnings.simplefilter(\"ignore\")\n",
    "\n",
    "    sample_sub = pd.read_csv(PATH_TO_DATA / \"sample_submission.csv\", index_col=\"id\")\n",
    "    sample_sub[\"dep_delayed_15min\"] = ctb_test_pred\n",
    "    sample_sub.to_csv(\"ctb_pred.csv\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!head ctb_pred.csv"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "lang": "fr"
   },
   "source": [
    "Maintenant c'est votre tour ! Lancez-vous et améliorez le modèle à battre **\"A2 base de référence (10 crédits)\"** - **0,75914** score LB. Il est crucial de proposer de bonnes caractéristiques.\n",
    "\n",
    "Pour les discussions, reférez-vous en au fil **#a2_kaggle_fall2019** du canal **mlcourse_ai_news** [ODS Slack](http://opendatascience.slack.com). Serhii Romanenko (@serhii_romanenko) sera là pour vous aider.\n",
    "\n",
    "Bienvenue sur Kaggle!\n",
    "\n",
    "<img src='https://habrastorage.org/webt/fs/42/ms/fs42ms0r7qsoj-da4x7yfntwrbq.jpeg' width=50%>\n",
    "*source [\"Nerd Laughing Loud\"](https://www.kaggle.com/general/76963).*"
   ]
  }
 ],
 "metadata": {
  "hide_input": false,
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.4"
  },
  "nbTranslate": {
   "displayLangs": [
    "*"
   ],
   "hotkey": "alt-t",
   "langInMainMenu": true,
   "sourceLang": "en",
   "targetLang": "fr",
   "useGoogleTranslate": true
  },
  "toc": {
   "base_numbering": 1,
   "nav_menu": {},
   "number_sections": false,
   "sideBar": false,
   "skip_h1_title": false,
   "title_cell": "Table of Contents",
   "title_sidebar": "Contents",
   "toc_cell": false,
   "toc_position": {
    "height": "427px",
    "left": "64px",
    "top": "180px",
    "width": "257px"
   },
   "toc_section_display": true,
   "toc_window_display": false
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
